library(readr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(caret)
library(plotly)
library(randomForest)
library(nnet)
We will now focus our attention on seagrass, and aim to build a classification algorithm to predict presence or absence of any seagrass species in the Great Barrier Reef. We will use location, types of sediment, and types of seabed to predict the presence of four seagrass species: Cymodocea serrulata, Syringodium isoetifolium, Thalassia hemprichii, and Zostera muelleri (subspecies capricorni).
# read data
setwd("cleaned_data")
seagrass <- read.csv("seagrass_classification_data.csv", as.is =TRUE)
# factorize relevant variables
seagrass$SPECIES <- as.factor(seagrass$SPECIES)
seagrass$SEDIMENT <- as.factor(seagrass$SEDIMENT)
seagrass$TIDAL <- as.factor(seagrass$TIDAL)
# view data
head(seagrass)
## SPECIES LATITUDE LONGITUDE DEPTH SEDIMENT TIDAL
## 1 Z_CAPIRCOR -23.65857 151.1206 0 Mud intertidal
## 2 Z_CAPIRCOR -23.65840 151.1212 0 Mud intertidal
## 3 Z_CAPIRCOR -23.65898 151.1203 0 Mud intertidal
## 4 Z_CAPIRCOR -23.65970 151.1197 0 Mud intertidal
## 5 Z_CAPIRCOR -23.65884 151.1205 0 Mud intertidal
## 6 Z_CAPIRCOR -23.65993 151.1197 0 Mud intertidal
summary(seagrass)
## SPECIES LATITUDE LONGITUDE DEPTH
## C_SERRULAT: 1256 Min. :-24.20 Min. :142.5 Min. : 0.0000
## S_ISOETIFO: 101 1st Qu.:-23.79 1st Qu.:146.8 1st Qu.: 0.0000
## T_HEMPRICH: 823 Median :-22.41 Median :150.5 Median : 0.0000
## Z_CAPIRCOR:10201 Mean :-21.06 Mean :149.0 Mean : 0.9253
## 3rd Qu.:-19.17 3rd Qu.:151.3 3rd Qu.: 1.2279
## Max. :-10.75 Max. :151.9 Max. :29.3583
##
## SEDIMENT TIDAL
## Gravel: 1 intertidal:7433
## Mud :8969 subtidal :4948
## Reef : 63
## Rock : 66
## Rubble: 107
## Sand :3151
## Shell : 24
First we plot seagrass according to location (latitude and longitude). Then we will add a third axis (water depth) to visualize this in 3-dimensions using the plotly package. Since depth is measured in meters below sea level, we visualize this in negative values.
seagrass %>%
ggplot() +
geom_point(aes(x=LATITUDE, y=LONGITUDE, color=SPECIES)) +
ggtitle("Australia, Northeast Coast (Great Barrier Reef)")
plot_ly(seagrass, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
We can also use histograms to see how our categorical predictors, sediment and seabed type, relate to the number of observed seagrass species:
seagrass %>%
ggplot(aes(SEDIMENT, fill=SPECIES)) +
geom_bar(width=.5, position = "dodge")
seagrass %>%
ggplot(aes(TIDAL, fill=SPECIES)) +
geom_bar(width=.5, position = "dodge")
Our first attempt at the classifier (to predict presence of any seagrass species) will use random forest. We will first partition the data set into a training and testing set. Since we have over 12,000 observations, we can allocate 75% of the data to the training set. We will fit the model using 500 trees and 2 variables to split at each node (~p/4).
# train-test split
seagrass_train_ind <- createDataPartition(y = seagrass$SPECIES, p=0.75, list=FALSE)
train_set <- seagrass[seagrass_train_ind, ]
test_set <- seagrass[-seagrass_train_ind, ]
# fit RF using relevant features, with mtry~p/3
rf_fit <- randomForest(SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT + TIDAL,
data=train_set, mtry = 2)
# build vector of predictions
rf_pred <- predict(rf_fit, newdata = test_set, type = "response")
# view performance metrics
confusionMatrix(table(pred = rf_pred, true = test_set$SPECIES))
## Confusion Matrix and Statistics
##
## true
## pred C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR
## C_SERRULAT 229 11 21 18
## S_ISOETIFO 5 7 3 1
## T_HEMPRICH 4 2 174 0
## Z_CAPIRCOR 76 5 7 2531
##
## Overall Statistics
##
## Accuracy : 0.9505
## 95% CI : (0.9423, 0.9579)
## No Information Rate : 0.8242
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8291
##
## Mcnemar's Test P-Value : 5.783e-11
##
## Statistics by Class:
##
## Class: C_SERRULAT Class: S_ISOETIFO Class: T_HEMPRICH
## Sensitivity 0.72930 0.280000 0.84878
## Specificity 0.98201 0.997067 0.99792
## Pos Pred Value 0.82079 0.437500 0.96667
## Neg Pred Value 0.96980 0.994152 0.98936
## Prevalence 0.10149 0.008080 0.06626
## Detection Rate 0.07401 0.002262 0.05624
## Detection Prevalence 0.09017 0.005171 0.05818
## Balanced Accuracy 0.85566 0.638534 0.92335
## Class: Z_CAPIRCOR
## Sensitivity 0.9925
## Specificity 0.8382
## Pos Pred Value 0.9664
## Neg Pred Value 0.9600
## Prevalence 0.8242
## Detection Rate 0.8180
## Detection Prevalence 0.8465
## Balanced Accuracy 0.9154
We see that our classification model works quite well, especially for T_HEMPRICH and Z_CAPRICOR, which have 85%+ sensitivity and specificity. However, this model got quite a low sensitivity for S_ISOETIFO. Recall from our data wrangling process that S_ISOETIFO had only about 100 “Yes” observations. Since we had such low variation in the values of S_ISOETIFO relative to the other three seagrass species, this may have contributed to the low sensitivity.
We can visually assess our predicted values with true values of species to see how our model performed.
# true values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
# predicted values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~rf_pred, type="scatter3d", mode="markers")
These plots look fairly similar across the training and test sets, for all four species.
Below we see that across sediment types, the species predictions (rf_pred) appear very similar to the true species distribution (SPECIES). The same is true across seabed types, in the two bar graphs further below.
test_set %>%
ggplot(aes(SEDIMENT, fill=SPECIES)) +
geom_bar(width=.5, position = "dodge")
test_set %>%
ggplot(aes(SEDIMENT, fill=rf_pred)) +
geom_bar(width=.5, position = "dodge")
test_set %>%
ggplot(aes(TIDAL, fill=SPECIES)) +
geom_bar(width=.5, position = "dodge")
test_set %>%
ggplot(aes(TIDAL, fill=rf_pred)) +
geom_bar(width=.5, position = "dodge")
Finally, we examine variable importance using the Gini coefficient:
variable_importance <- importance(rf_fit)
tmp <- data_frame(feature = rownames(variable_importance),
Gini = variable_importance[,1]) %>%
arrange(desc(Gini))
tmp %>% ggplot(aes(x=reorder(feature, Gini), y=Gini)) +
geom_bar(stat='identity') +
coord_flip() + xlab("Feature") +
theme(axis.text=element_text(size=8))
We see that longitude and latitude were very predictive of presence of seagrass, followed by water depth. The types of sediment and seabed are not very important predictors. Thus, it seems that the location of where the sea grass was discovered matters more than the various ocean floor properties.
We will now try a multinomial logistic regression model to see how it compares to the random forest we fit above. We will use the nnet package. The logistic regression model is as follows:
# fit model
multinom_fit <- multinom(SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT, data=train_set)
## # weights: 44 (30 variable)
## initial value 12874.515732
## iter 10 value 3238.252953
## iter 20 value 2574.820181
## iter 30 value 2476.580036
## iter 40 value 2412.749553
## iter 50 value 2398.326504
## iter 60 value 2366.750453
## iter 70 value 2366.650134
## iter 80 value 2366.518268
## iter 90 value 2364.884233
## iter 100 value 2364.752964
## final value 2364.752964
## stopped after 100 iterations
summary(multinom_fit)
## Call:
## multinom(formula = SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT,
## data = train_set)
##
## Coefficients:
## (Intercept) LATITUDE LONGITUDE DEPTH SEDIMENTMud
## S_ISOETIFO -410.4873 2.0669651 2.665179 0.006176328 53.94191
## T_HEMPRICH -171.5170 1.9202662 1.294884 -0.329414604 13.61520
## Z_CAPIRCOR -278.8529 0.7096458 1.510618 -0.784794642 72.97421
## SEDIMENTReef SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO 54.99932 54.52452 -20.67969 54.77416 55.67214
## T_HEMPRICH 16.91114 17.86782 18.53850 16.89500 16.44980
## Z_CAPIRCOR 69.07350 70.34708 69.10061 72.10102 71.44704
##
## Std. Errors:
## (Intercept) LATITUDE LONGITUDE DEPTH SEDIMENTMud
## S_ISOETIFO 0.002560385 0.05895299 0.008095859 0.02446157 0.4020666
## T_HEMPRICH 0.001435543 0.05912903 0.007588627 0.04259597 0.3336095
## Z_CAPIRCOR 0.002651486 0.02880042 0.004656218 0.02864185 0.3578270
## SEDIMENTReef SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO 0.6400785 0.9298980 4.975118e-14 0.3747139 0.9600996
## T_HEMPRICH 0.5702821 0.6293914 4.956410e-01 0.3011637 1.1222749
## Z_CAPIRCOR 0.9370497 0.7180181 1.132460e+00 0.3595152 0.9014128
##
## Residual Deviance: 4729.506
## AIC: 4789.506
# predicted probabilities
predicted_prob <- predict(multinom_fit, newdata=test_set, type="probs")
# predicted classes
predicted_class <- predict(multinom_fit, newdata=test_set, type="class")
# classification performance metrics
confusionMatrix(data = predicted_class, reference = test_set$SPECIES )
## Confusion Matrix and Statistics
##
## Reference
## Prediction C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR
## C_SERRULAT 131 9 10 27
## S_ISOETIFO 0 0 0 0
## T_HEMPRICH 17 4 177 21
## Z_CAPIRCOR 166 12 18 2502
##
## Overall Statistics
##
## Accuracy : 0.9082
## 95% CI : (0.8975, 0.9182)
## No Information Rate : 0.8242
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6611
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: C_SERRULAT Class: S_ISOETIFO Class: T_HEMPRICH
## Sensitivity 0.41720 0.00000 0.86341
## Specificity 0.98345 1.00000 0.98546
## Pos Pred Value 0.74011 NaN 0.80822
## Neg Pred Value 0.93726 0.99192 0.99026
## Prevalence 0.10149 0.00808 0.06626
## Detection Rate 0.04234 0.00000 0.05721
## Detection Prevalence 0.05721 0.00000 0.07078
## Balanced Accuracy 0.70033 0.50000 0.92444
## Class: Z_CAPIRCOR
## Sensitivity 0.9812
## Specificity 0.6397
## Pos Pred Value 0.9274
## Neg Pred Value 0.8788
## Prevalence 0.8242
## Detection Rate 0.8087
## Detection Prevalence 0.8720
## Balanced Accuracy 0.8104
We see that our multinomial logistic model has about 91% overall accuracy, which indicates lower performance than the random forest. The model seems to predict T_HEMPRICH the best with 88.7% sensitivity and 98.8% specificity. It peforms well for Z_CAPRICOR as well, but performs relatively poorly for C_SERRULAT and even worse for S_ISOETIFO with 0% sensitivity. Again, we can assess our results visually:
# true values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
# predicted values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~predicted_class, type="scatter3d", mode="markers")
In this case, the truth and prediction plots look different, particularly in that the prediction plot does not appear to show any instances of S_ISOETIFO. This makes sense, given the small sample size and low variability in the values of the S_ISOETIFO variable.
Below we see that across sediment types, the species predictions (rf_pred) appear fairly similar to the true species distribution (SPECIES). This is not true across seabed types, in the two bar graphs further below - there we see that S_ISOETIFO is never predicted, and C_SERRULAT is predicted less often than in the truth.
test_set %>%
ggplot(aes(SEDIMENT, fill=SPECIES)) +
geom_bar(width=.5, position = "dodge")
test_set %>%
ggplot(aes(SEDIMENT, fill=predicted_class)) +
geom_bar(width=.5, position = "dodge")
test_set %>%
ggplot(aes(TIDAL, fill=SPECIES)) +
geom_bar(width=.5, position = "dodge")
test_set %>%
ggplot(aes(TIDAL, fill=predicted_class)) +
geom_bar(width=.5, position = "dodge")
The overall accuracy for the multinomial logistic regression model was 90.9%, and that of the random forest model was 95.6%. While these may seem fairly close, the accuracy for the multinomial logistic model is deceiving since it performed particularly poorly in terms of sensitivity in 2 out of 4 classes.
With approximately 96% overall accuracy, the random forest model performed better than the multinomial logistic regression model (with 91% accuracy) in predicting the species of a seagrass observation based on several environmental inputs. The best predictors of seagrass presence were latitude, longitude, and depth below sea level. Prediction of seagrass species was more closely related geographic location (latitude and longitude) than sediment type or tidal zone. However, despite the high overall accuracy of both models, insufficient observations of S. isoetifolium presence led to low performance of both models for predicting the presence of this seagrass species.
Knowledge gained from this model could be helpful to better understand which seagrass species are present in specific locations over time, which may be an important indicator of reef health. Seagrass ecology is a marker of plant diversity and presence or absence of specific species may change with environmental changes in the reef over time such as climate change. Over time, sea water temperature could lead to changes in seagrass ecology by altering tidal zones and sediment type, two predictors in our models. Even the information represented by geographic location could change over time, if the location variable reflects the combined effects air and water temperature, water quality, and ocean and air currents.